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Abstract 

A new set of parameters to describe the word frequency behav- 
ior of texts is proposed. The analogy between the word frequency 
distribution and the Bose-distribution is suggested and the notion of 
"temperature" is introduced for this case. The calculations are made 
for English, Ukrainian, and the Guinean Maninka languages. The cor- 
relation between in-deep language structure (the level of analyticity) 
and the defined parameters is shown to exist. 
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1 Introduction 



Quantitative analysis of large text samples revealed regularities in the be- 
havior of various text parameters. The empirical laws found in texts, such 
as Zipf's law, are known to hold in various domains, in particular the dis- 
tribution of nucleotides in genomes and other fields of biology [U El El H], 
regularities in social sciences [5J [6], [TJ, [HI E] , etc. 

Approaches from the domain of statistical physics can be used to study 
systems composed of many units in general, and texts are suitable for such 
studies as well. The application of physical techniques in linguistic is quite 
common [101 [EU [121 EH EH] , other domains are also successfully covered by 
physical approaches, cf. [T5] . 

In this work, we analyze quantitative behavior of texts by finding analogy 
with a bosonic system within grand canonical ensemble. In doing so, we 
demonstrate the possibility to assign some new parameters characterizing 
the frequency structure of texts, one of which can be conventionally called 
"temperature" . 

The notion of "temperature of texts" was discussed from different points 
of view by several authors. Mandelbrot [16] suggested the name "informa- 
tional temperature of texts" for a parameter in a rank-frequency distribution 
(known as the Zipf-Mandelbrot law). Such a parameter is related to "good" 
or "bad" employment of words, especially rare words [T7]. The "tempera- 
ture" as a measure of communicative ability was introduced in [18] . Recently, 
Miyazima and Yamamoto [19] used the classical Boltzmann distribution to 
define the "temperature of texts" from the frequency data of the most fre- 
quent words. We propose a different approach, mainly addressing the behav- 
ior of low-frequency vocabulary. 

The paper is organized as follows. In Section [2] we recall main notions 
used in further text, namely the principles of rank-frequency distribution 
compilation as well as the term hapax legomena. Section [3] contains main 
part, where the physical analogy with the Bose-distribution is discussed in 
detail and parameters of text frequency distribution are given suitable in- 
terpretation. The results of text analysis in three languages are given in 
Section [U and Section [5] contains brief discussion of the presented approach. 
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2 Rank— frequency distribution 



In this work, we analyze texts on the word level. While the notion of "word" 
has no unique definition, cf. [20], we restrict ourselves to the so called "or- 
thographic word" defined as an alphanumeric sequence between two spaces 
or punctuation marks. Different word forms, like 'hand' and 'hands', 'write' 
and 'wrote', etc. are considered as different words for simplicity. 

To obtain a rank-frequency distribution, one should first compile the 
frequency list from a given sample. Then, the item with the highest frequency 
is given rank 1, the second most frequent item is given rank 2, and so on. 
The items with the same frequency are given a consecutive range of ranks, 
the ordering within which can be arbitrary. 

The studies of rank-frequency distributions originate from text analysis, 
and despite the regularities found there are known to hold in various domains, 
texts still remain the most easily accessible material having a good variety 
of sorts to be analyzed. 

A typical rank-frequency distribution has the shape shown in Fig. [TJ 




Figure 1: Typical rank-frequency distribution. The absolute frequency / is 
shown versus the rank r for orthographic words of Perekhresni stezky [The 
Cross-Paths], a Ukrainian novel by Ivan Franko. Data are obtained by the 
authors on the preliminary stage of compiling the frequency dictionary of the 
novel [21]. 
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Horizontal plateaus in the domain of high ranks / low frequencies corre- 
spond to a large number of words having the same frequency. The longest 
plateau correspond to frequency 1. Such words are known as hapax legomena, 
the term originating from Bible studies. 

Hapax legomena is a Plural of the Classical Greek term hapax legomenon 
(3bta£ Xey6[i£vov) translated as '[something] said [only] once'. That is, this 
term corresponds to the tokens appearing only once in a given sample. Ex- 
amples from Bible include [22]: ft*'?''? 'Lilith' (a word of obscure meaning) 
or iSi-'Xjj 'gopher wood' (used to build Noah's Ark). Other often cited ex- 
amples are: avzoyvov, a kind of plough (Hesiod, "Epya Kai 'Hpcpai [Opera et 
Dies = Works and Days], 433); honorificabilitudinitatibus 'the state of being 
able to achieve honors' (Shakespeare, Loves Labours Lost, act 5, scene 1 [2"3"| 
p. 372]). 

For large text samples, about 40 to 60 per cent of occurring words are 
hapaxes, depending on the text size [2JJ p. 72]. The relative number of hapax 
legomena slightly decreases as the text becomes longer. Various quantities 
depending on the text size N are well described by the power law [25], and 
the number of hapaxes fits into this family as well, 

iV hapax = AN". 

Note, however, that for statistical studies texts must be sufficiently long. 

Indeed, even in such a long sentence having twenty-three to- 
kens all the words are hapaxes, except for "hapaxes" themselves 
since they occur twice. 

3 Physical analogy 

The rank-frequency distribution of words in texts has clear similarities with 
Bose-distribution in statistical physics. We suggest to identify the energy 
level numbers j with word frequencies (the number of occurrences in a given 
text). Thus, the words with frequency 1 occupy the level j = 1, the words 
with frequency 2 occupy the level j = 2, etc. The level occupation then 
corresponds to the number of different words with the same frequency. Since 
the level occupation can reach any value (in particular, significantly larger 
than unity) the use of the Bose-distribution is appropriate. The lowest level 
corresponds to hapax legomena and in this scheme can be identified with the 
Bose-condensate. 
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3.1 Defining energy spectrum 

In the Bose-distribution the occupation of the jth level is given by 

where z is the fugacity, Sj is the energy of the jth level, and T is the tem- 
perature. 

As shown further, a power energy spectrum gives a proper description for 
lower levels, 

e j = (J ~ l) a - (2) 

The unity is subtracted to ensure that the lowermost level has zero energy. 

Due to the nature of the frequency distribution, a simple model of a very 
weak log-of-log growth is appropriate for the energy spectrum at high levels, 
Ej oc lnlnj for j 1, cf. Fig. El Note, however, that a log-of-log spectrum 
requires the maximal number of levels to be bounded from above by some 

Jmax' 

3.2 Parameters of the Bose-distribution 

We defined the parameters in Eq. (CQ) in two steps. First, the parameter 
z, being interpreted as fugacity in physics, is defined from the occupation 
number of the lowermost state, i.e., the number of hapax legomena: 

Nh&p&x = • (3) 

1 — 2 

"Temperature" T and exponent a in Eq. (j2J) are found simultaneously by 
fitting the occupation of higher energy levels to 

N i = z -i e (i-i) Q /r _ i ^ 

via two parameters, a and T. The sample results of fitting are presented 
in Fig. EJ These calculations, as well as other given further in these work, 
were made using the nonlinear least-squares Marquardt-Levenberg algorithm 
implemented in the fit procedure of GnuPlot, version 4.0. 

One should note that the parameter T is dimensionless in our case, as is 
the energy Ej. Such definition differs, e.g., from [19], where a distribution of 
some standard text was used to set the reference temperature in Kelvins. 
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Figure 2: (Color online) The fit of the power energy spectrum to the level 
occupations. Blue crosses correspond to the data obtained by the authors 
on the basis of first 40 chapters (of total 60) from the text mentioned in the 
caption of Fig. [TJ Solid line is the fitting curve (]1J) for the first 20 values of 
occupation numbers Nj. 

The state with T = corresponds to all the frequencies equal to unity 
that is, the whole text is composed of hapax legomena. This could be the 
case of a very short text, not longer than just one or a couple of sentences 
(cf. the example at the end of Section [2]). 

Presently, we fit first 10-20 levels using the power excitation spectrum (j2J). 
Higher levels are neglected since a different dependence on j must be applied 
to ensure good fitting of the occupation data Nj, a suggested in the previous 
subsection. The parameter T obtained in such a way scales (very precisely) as 
N 13 ((3 < 1). The scaling is related to the definition of "thermodynamic limit" 
for the problem under consideration. Just to recall, in the system of N bosons 
trapped to a D- dimensional harmonic oscillator potential with frequency u> 
the thermodynamic limit is given by ujN 1 ^ d = const as N — > oo,w — > 
[27]. Since u (or hu> if Planck's constant H is not set equal to unity) is a 
natural unit for the oscillator energy, the power-like scaling of the quantities 
measured in the energy units is expectable for the and for the systems with 
power energy spectrum as well. 

Curiously, the ratio InT/ In iV exhibits an insignificant variation with the 
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size of the text sample (for a sufficiently long text). This makes it a good 
variable for comparative linguistic studies. 



4 Some results 

So far, we have performed analysis of some texts written in English (Ger- 
manic language), Ukrainian (Slavic language), as well as Guinean Maninka 
(in the Nko script; a language from the Mande family). Such a vast choice 
is suggested to check the approach on significantly different language mate- 
rials in order to reveal both universal and unique features of the parameter 
behavior. 

Fig. E] demonstrates the "temperature" behavior of an English text (Moby- 
Dick by Hermann Melville) and two novels in Ukrainian (Perekhresni stezky 
[The Cross-Paths] by Ivan Franko and Sobor [The Cathedral] by Oles Hon- 
char) . 




Figure 3: (Color online) The behavior of "temperature" as the size of text 
grows. MD — Moby-Dick; PS — Perekhresni stezky; So — Sobor). The 
lines correspond to the linear fits of the data represented by the respective 
symbols. 



Table [T] shows the numerical data on the parameter T calculated by 
grasping increasing shares of chapters. The data based on an article from a 
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Guinean journal Yelen are also given. 

The values of z in all the cases are close to 1. A better resolution might be 
achieved by introducing an analog of the chemical potential \x in a standard 
way, z = e^l T . 

The fitting gives the values of the exponent a slightly decreasing as the 
text size grows. An interpretation in terms of an external potential can be 
applied to justify such a change. Indeed, if the presence of an external poten- 
tial is treated in the semiclassical approach [26] , the decreasing values of the 
exponent in a power excitation spectrum effectively correspond to weakening 
of the steepness of an external potential. That is, as a text becomes longer, 
it suffers less from some external influences. 

Indeed, in one dimension a power energy spectrum e p oc p a leads to the 
density of states 

g(e)(xe^~ 1 . (5) 

On the other hand, non-interacting particles confined into trapping potential 
U (x) oc x v in the semi-classical approach [26J have the density of states 

i_ i 

g(e) oc e 7 ! 2 

Note that a rigid box corresponds to T] — oc. 

Thus, an effectively occurring exponent a is related to r\ via 

2ri 

a = 

V + 2 

leading to a = 3/2 for rj = 6 and a = 1 for rj = 2. 

Preliminary, the obtained T and a-exponent values correlate with the an- 
alyticity level of the language. Lower values correspond to higher analyticity 
(less word inflection), as can be seen from the opposition between English 
and Ukrainian (both Indo-European languages). So far, we do not have suf- 
ficient data to make further statements, in particular, for the language from 
an unrelated language family (Mande), which data are given for curiosity and 
future references. As should be expected, a low value of a for the Maninka 
sample suggests a high level of analiticity. 

Finally, in Fig. H]we present the results of "temperature" calculation made 
for short Ukrainian texts of different genres [28]. Close values denote weak 
genre dependence of this parameter. A multivariate discriminant analysis is 
required to study this issue in more detail. 
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Figure 4: (Color online) The behavior of lnT/lniV for texts from different 
genres. Open letters, private letters, and sermons are shown. 



5 Brief discussion 

The presented results are a preliminary attempt to define a new set of pa- 
rameters describing frequency structure of texts. Further application of this 
approach to a larger number of texts from different languages written by 
different authors is required to establish the correlation between parame- 
ter values and text language/authorship. The shape of the spectrum in the 
whole domain of the variation of j must be considered in further studies to 
give a proper description of the level occupation. Also, more parameters can 
be calculated within the "thermodynamic approach" (like some analogs of 
total energy, specific heat, etc., cf. [IS])- One of the tasks which we expect 
from such calculations is the possibility of automatic text attribution use- 
ful for automated language processing. Applications beyond linguistics - in 
genetics, social sciences, etc. - are also possible in future. 
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Table 1: The parameters of "energy spectrum" and "temperature" of texts 



N 


a 


T 


InT/ IniV 


T/N 


Moby-Dick (ENG) 


5942 


1.97 


470.4 


0.708 


0.0792 


39363 


1.60 


1773.3 


0.707 


0.0451 


66916 


1.56 


2639.7 


0.709 


0.0394 


107503 


1.48 


3622.3 


0.707 


0.0337 


132968 


1.48 


4207.3 


0.707 


0.0316 


165746 


1.48 


4968.4 


0.708 


0.0300 


191040 


1.47 


5476.5 


0.708 


0.0287 


215270 


1.45 


5791.3 


0.706 


0.0269 


IlepexpecHi cmemcKU 


(The Cross-Paths) (UKR) 


343 


1.57 


26.6 


0.562 


0.0774 


1052 


2.03 


119.1 


0.687 


0.1132 


12949 


1.68 


812.0 


0.708 


0.0627 


28010 


1.73 


1610.1 


0.721 


0.0575 


40811 


1.72 


2270.7 


0.728 


0.0556 


54361 


1.70 


2964.3 


0.733 


0.0545 


70330 


1.64 


3597.4 


0.734 


0.0512 


96083 


1.57 


4561.4 


0.734 


0.0475 


a±ol Yelen (The Light) journal (NKO) 


429 


1.42 


45.2 


0.629 


0.1053 
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